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Abstract 

In this paper, we present algorithms for integrating machine learn- 
ing algorithms for acquiring labeled data into crowd-sourced data- 
bases. The key observation is that there are a number of tasks for 
which humans and machine learning algorithms can be comple- 
mentary, e.g., at labeling images where humans generally provide 
more accurate labels but are slow and expensive, while algorithms 
are usually less accurate but faster and cheaper. Based on this, we 
present two active learning algorithms designed to decide how to 
use humans and algorithms together in a crowd-sourced database. 
We look at two settings, namely the upfront and the iterative set- 
tings. In the upfront setting, we try to identify items that would 
be hard for algorithms to label, and ask humans to label them. In 
the iterative setting, we iteratively choose the best items to ask hu- 
mans to label and retrain the model after incorporating these so as 
to improve the quality of the classifier. We propose several dif- 
ferent algorithms in each of these settings, based on the theory of 
non-parametric bootstrap, which makes our results applicable to a 
broad class of machine learning models. We also look at a range 
of issues specific to crowds, such as the fact that crowd-generated 
labels can be incorrect, and that multiple crowd workers can be 
"batched" to perform labeling on several different items simulta- 
neously. Our results, on three data sets collected with Amazon's 
Mechanical Turk, and on 15 data sets from the UCI KDD archive, 
show that our methods on average ask one to two orders of magni- 
tude fewer questions than a random baseline, and two to eight times 
fewer questions than previous active learning schemes. 

1. Introduction 

Crowd-sourced market places like Amazon's Mechanical Turk 
make it possible to recruit people to perform tasks that are difficult 
for computers, such as identifying objects in an image, summariz- 
ing a piece of text, or expressing an opinion about a product. There 
has recently been a great deal of interest in the database and HCI 
communities in integrating such human intelligence tasks (HITs) 
into data processing workflows. Examples include using HITs as 
part of workflows to support collaborative editing in a word pro- 
cessor (5) or to facilitate tasks such as data-cleaning, entity reso- 
lution ]35| [4), audio transcription, image annotation (6), and sen- 
timent analysis (25). Many of these can be thought of as database 
problems, where some of the labels on data items (or some of the 
data items altogether) are missing and need to be supplied by crowd 
workers. 

Although humans are often better than machines at tasks like ex- 
tracting sentiment and labeling images, using humans for process- 
ing tens of thousands to tens of millions of documents or images 
is not cost-effective. In this paper we advocate an alternative ap- 
proach: (i) use humans to facilitate the training of machine learning 



algorithms such as classifiers, which can then be used to complete 
these tasks at a greatly reduced time and cost, and (ii) identify and 
ask ambiguous questions from the crowd that are inherently diffi- 
cult for machine learning algorithms, e.g., blurry images or sen- 
tences with a vague sentiment. 

The primary challenge in such a setting is determining which 
questions to ask from the crowd. Specifically, given a database of 
unlabeled data, and a labeling algorithm that can attach a label to 
each data item once it has been trained with hand-labeled data from 
the crowd, what is needed is a way to determine which and how 
many items we should ask the crowd to label for training purposes. 
There are two aspects to this problem: (i) determining what data 
would be more beneficial for training, and (ii) determining which 
data items will be hard for the classifier (no matter how much train- 
ing data it has). Given an ability to estimate these quantities, we 
need an algorithm to optimally allocate a time or dollar budget for 
crowd questions between these two types, in order to achieve the 
best cost or quality results. 

Although this is similar to the classical problem of active learn- 
ing in the machine learning community, where the goal is select- 
ing statistically optimal training data [10], we believe we are the 
first to consider the general version of this problem in the con- 
text of crowd-sourced databases. Developing active learning algo- 
rithms for crowd-sourced databases introduces a number of new 
challenges that were generally not faced in traditional active learn- 
ing literature, specifically: 

• Higher degree of noise. Many traditional active learning al- 
gorithms deal with expert-provided labels that are assumed to be 
the ground-truth (notable exceptions include agnostic active learn- 
ing approaches [3]). In contrast, crowd-provided labels are sub- 
ject to a much higher degree of uncertainty, e.g., innocent errors 
incorporated due to typos or abbreviations, lack of enough domain 
knowledge, and deliberate mistakes from spammers, none of which 
have been a great source of concern in active learning with expert- 
provided labels 1 29) . 

• Generality. The active learning literature has mostly dealt 
with learning problems where a specific task is addressed in a well- 
understood domain (e.g., for SVMs in text mining |31|). While 
there are active learning techniques that are designed for generic 
classifiers, this generality requirement is much more stringent in a 
database setting: users can issue a wide range of queries, crowd 
workers can exhibit a wide range of behaviors, and the system will 
be most useful if it can support general classification algorithms 
that users supply with their queries. 

• Scalability. Many crowd-sourcing scenarios involve web-scale 
data: making sense out of tens of millions daily tweets, images, 
videos and blog posts. One aspect of scalability is limiting the num- 
ber of questions that must be asked from the crowd. Another aspect 



is that training classifiers can be quite slow, especially if the train- 
ing process needs to be repeated after every new labeled item that 
the crowd provides. Thus, we need active learning algorithms that 
can scale to very large datasets, and ideally be able to scale across 
many cores and machines. 

• Ease-of-use. Database systems are used by a general class of 
users. As a result, our integrated learning algorithms need to be 
simple to use, and able to automatically handle a number of impor- 
tant optimization questions, e.g., when to stop the active learning 
process? What is the required degree of redundancy in acquiring 
labels? What is the optimal batch size to meet the time/quality/cost 
requirements of a user? 

Our solution to these problems consists of two new active learn- 
ing algorithms, called Uncertainty and MinExpError. Uncertainty 
is suited to an online/iterative setting, where we are seeking labels 
for one or a few items at a time, while MinExpError is more appro- 
priate in an upfront setting, where we are using the crowd to label a 
set of items while concurrently running a classifier to label the rest 
of the items (upfront is appropriate when classifiers and crowds 
perform very differently on different subsets of items). These algo- 
rithms can be used in a broad class of categorization/classification 
tasks to adaptively choose the unlabeled questions that need to be 
asked from the crowd. They are both based on non-parametric 
bootstrap 1 13 1, which produces estimates of uncertainty for an es- 
timator (e.g. parameters learnt by a machine learning classifier) 
by repeatedly re-computing the estimator on bootstrap replicates 
(samples drawn with replacement from the training data) and mea- 
suring the variability of the estimator across these replicates. Using 
these bounds, we can decide which items to ask the crowd about. 
We also develop a novel ILP-based algorithm to decide how many 
redundant labels to request for each data item, based on an esti- 
mate of the probability that individual crowd workers will mislabel 
particular classes of items. 

With the exception of (28) (which works only for probabilistic 
classifiers), no attention has been given to exploiting the power of 
bootstraps in active learning methods, mainly because of the com- 
putation overhead of bootstraps. But with recent advances in paral- 
lelizing bootstrap computation |20| as well as the rapid increase in 
RAM sizes and the number of cores on modern servers, bootstrap 
has become a much more computationally viable approach. 

We demonstrate the effectiveness of our algorithms on 5 crowd- 
sourced and 15 well-known real-world data sets. The experiments 
show that compared to choosing items to label at random, our al- 
gorithms on average make 7x fewer label requests in the iterative 
scenario and more than 100 x fewer requests in the upfront sce- 
nario. Also, compared to previous state-of-the-art active learning 
techniques, our algorithms ask 2 to 12 x fewer questions on aver- 
age. 

In the next section, we present our high level approach and a 
background on active learning, before introducing our proposed al- 
gorithms in Section [5] which is followed by our crowd database- 
specific optimizations in Section [4] and our extensive empirical 
evaluations in Section 5] Finally, we review the related work and 
conclude in Sections|6 and|7] 



2. Overview of Approach 

Our method is based on Active learning (a.k.a. closed-loop learn- 
ing, query learning or optimal experimental design in the statistics 
literature). This is an area of machine learning where labeled in- 
stances are scarce or expensive to obtain, e.g. audio/video tran- 
scription where according to reports, word-level annotation of each 
minute of speech takes ten minutes, and annotating phonemes can 



take 400 times longer (e.g., almost seven hours) [37| . In this set- 
ting, the training algorithm is allowed to query for labels of un- 
labeled datapoints. A better active learning strategy makes fewer 
label requests for acquiring the same level of accuracy. 

2.1 Basic Algorithms 

Suppose we have a budget (money, time, or number of questions) 
B for asking questions or a quality requirement A (e.g., required 
accuracy or Fl-measur^J for our classifier that we need to achieve. 
Given a set of unlabeled points, we have some basic strategies to 
query labels from the crowd. First in the upfront strategy, we train 
our model on the available labeled data, and based on this select 
unlabeled points for which the crowd is queried. The final classifier 
takes as an input the union of these two sets. Alternatively, in the 
iterative approach, we use the crowd to label a few points, and add 
those labels to the existing training set, retrain, choose a new set 
of unlabeled points, and then iterate until we have exhausted our 
budget or met our quality goal (e.g. by using cross-validation on 
the training data). The upfront scheme clearly requires less training 
effort because we do not have to repeatedly retrain our model and 
recompute the scores on the unlabeled points, and then wait for the 
crowd's response. However, the iterative scheme can adaptively 
adjust its scores in each iteration, thus having a better chance than 
the upfront scheme of reaching a smaller error for the same budget. 
In contrast, the upfront scheme has to commit to all the labels it 
wants at once, based only on an initial (limited) set of labeled data. 
Psuedocode for these two algorithms is given in Figures[TJand[2] 

Upfront Active Learning(Q or B,Lq, U,0,H, S) 

Input: Q is the quality requirement (minimum accuracy, Fl-measure), 

B is the total budget (money, time, or # of questions), 

Lo is the initial labeled data, 

U is the unlabeled data, 

9 is a classification algorithm to (imperfectly) label the data, 
1Z is a ranker that gives effectiveness scores to unlabeled instances, 
5 is a selection strategy (which labels need to come from crowd). 
Output: L is the labeled version of U 
1: CL <— // labeled data acquired from the crowd 
2: L <- e L » (U) //train on L & invoke it to label U 
3: W <- K(U, 6) llwi e"/is the effective score for m e U 
4: If under a certain budget B 

5: Choose U' C U based on selection strategy S(U, W) and budget B 
6: CL <— Ask the crowd to label U' 

7: ML <- e L °{U\U') //train on L to label remaining U\U' 
8: Else, //under a certain quality requirement Q 

9: Choose U' C U based on S(U, W) such that 8 L o ([/') satisfies A 
10: ML^9 L o(U') 

11: CL <— Ask the crowd to label the remaining U\U' 

12:L -s— CL U ML //combine crowd and machine provided labels 

Return L 

Figure 1: The upfront scenario in active learning. 

Note that the upfront scenario is not a special case of the iterative 
one. In other words, the upfront setting is different from the itera- 
tive setting because in the upfront setting we do not get to include 
the crowd-sourced labels when training the classifier. This differ- 
entiation is important as some applications may prefer the upfront 
scenario over the iterative one. For instance, when early results are 
strictly preferred, the model has to be invoked as soon as possible 
to return labels for as much of the data as possible, without having 



'Fl measure is the harmonic mean of precision and recall and is 
frequently used to assess the quality of a classifier. In a binary 
classifier, precision is the fraction of the positively labeled items 
that actually belonged to the positive class, and recall is the fraction 
of the actually positive items that were labeled as positive. 



Iterative Active Learning(Q or B,Lq, U,9,TL,S) 

Input: Q is the quality requirement (minimum accuracy, Fl-measure), 

B is the total budget (money, time, or # of questions), 

Lq is the initial labeled data, 

U is the unlabeled data, 

8 is a classification algorithm to (imperfectly) label the data, 
Ti is a ranker that gives effectiveness scores to unlabeled instances, 
cS is a selection strategy (which labels need to come from crowd). 
Output: L is the labeled version of U 



CL <— // labeled data acquired from the crowd 
L <- 9 L « (U) //train 8 on L & invoke it to label U 

While L's quality does not meet Q and our budget B is not exhausted: 
W -f- K{U, 8) llwi e W is the effective score for u z e U 
Choose U' C U by applying the selection strategy S on Wi scores 
L' 4- Ask the crowd to label U' 

CL <— CL U L', U -f— U\U' //remove crowd-sourced labels from U 
L <— CL U 8 L f> uCL (U) //train on L U CL to label remaining E/ 



Return L 



Figure 2: The iterative scenario in active learning. 

to wait for the crowd's answer. Another example is when strin- 
gent requirements are imposed as to only allow for gold standard 
data to be used for training (whereas crowd-sourced labels are often 
noisy). 

3. Algorithms 

Based on the criteria of generality, scalability and effectiveness 
outlined in the introduction, we have developed two bootstrap- 
based methods to estimate the benefit (in terms of improving model 
model's once the label is known) of having the crowd label an un- 
labeled data point in our input dataset. We use these two scores 
for the ranker (TZ in Figure[T]and[2|. Once we have the scores, we 
use biased sampling (that favors high scores) to choose a batch of b 
unlabeled data-points to be sent to the crowd for labeling; here the 
probability of choosing each unlabeled instance is proportional to 
its score. 

Both of our scores are based on the nonparametric bootstrap 1 13 1, 
a powerful statistical technique traditionally used for estimating the 
uncertainty of estimators. The main idea is simple and is based on 
the "plug-in" principle: use the empirical distribution of the data 
(say D) as a proxy for the true underlying unknown distribution 
(denoted by D). If we observe n i.i.d. sampled items ({asi, . . . ,x n }) 
drawn from an unknown probability distribution D, then the empir- 
ical distribution D is defined as a discrete distribution which puts 
a probability mass of 1/n on each value of x iy i = 1, . . . , n. Say 
we want to estimat^ja parameter 9 — t(D). The "plug-in" prin- 
ciple simply estimates 9 by 9 — t(D). In particular, bootstrap 
can be used to compute the bias and the variance of 9. The beauty 
of bootstrap is that it computes these quantities automatically for 
a wide class of estimators t. First we create m bootstrap datasets 
LI for k = 1, • • • ,m, where each L\ — {xl l: . . . , xl n } is con- 
structed by drawing n i.i.d. items with replacement from the ob- 
served dataset (note that these are the same size as the observed 
dataset). For each sample, we compute 9%. Now the variance of 9 
is estimated by the sample variance of {91, k — 1, . . . , m}. The 
bias can also be computed by a simple plug-in principle. 

The main advantage of bootstrap is two-fold: first, uncertainty 
can be estimated without making any simplifying assumption about 
the underlying distribution and second, the estimator can be arbi- 
trary, as long as some smoothness assumptions are satisfied. There- 
fore, by revisiting similar powerful theoretical results from classi- 
cal nonparametric statistics, not only can we estimate the uncer- 



tainty of arbitrarily complex estimators, but we can also scale the 
computation up to big volumes of data for the following reasons. 
First, being able to estimate the uncertainty of an estimator (e.g., 
probability of correctly classifying a point) allows us to stop invok- 
ing the crowd, once we are confident enough. Second, each sample 
LI can be processed independently, in parallel, allowing us to par- 
allelize the computation across multiple processors or cores. 

3.1 Ranking Algorithms 

Next, we describe our ranking algorithms (scores), called Uncer- 
tainty and MinExpError, which both use bootstrap to decide which 
item to ask about next. 

3.1.1 Minimizing the Uncertainty 

The goal of our Uncertainty algorithm is to estimate the confi- 
dence (or uncertainty) of a given classified # on each of the unla- 
beled items. Suppose, for the moment, that we have the underlying 
distribution T> over all the items and their true labels. Let the true 
label of item u be Y u , Having access to T>, we could easily draw 
many datasets {Li, i = 1, 2, • • • } from distribution T>, each of the 
same size as our original training data; then train our classifier on 
each of these Li's and predict it's label, denoted as 9(Li,u). Now 
by definition, the true variance of this classifier on u would be sim- 
ply the variance of these labels. Unfortunately, in reality T> is not 
available to us. Hence, we use the empirical distribution instead. In 
other words, we bootstrap the training data multiple (say m) times 
to obtain m different classifiers which are then invoked to gener- 
ate the label of a given test datapoint. Since the bootstrap sam- 
ples simply emulate the process of sampling data points from the 
original distribution, these bootstrapped labels can be thought of as 
labels given by classifiers learnt on different training datasets di- 
rectly sampled from the true distribution. We can then compute the 
variance of these m bootstrapped labels. By bootstrap theory |13| , 
this is guaranteed to quickly converge to the true variance as we in- 
crease m. For example, in our experiments we found that m — 10 
provides good estimates of the variance (we use m = 10 in all our 
experiments in Section[5]) 

From a classification point of view, the variance of the classifier's 
answer about a given item is important because the higher it is, the 
more likely the classifier is to mis-label the item. Therefore, once 
we compute the variance for each unlabeled item, we select items 
with larger variance (the probability of an item being chosen will 
be proportional to its variance) and ask the crowd to label them. 
The number of items selected in each iteration and the number of 
iterations depends on the budget, user requirements and whether 
we are operating in the upfront or iterative scenario^ 

Formally, our notion of uncertainty is simply the variance of the 
classifier in its predictions, as follows. 

Let LI denote the k th bootstrap, and := 9(L1,u) be the 
prediction of our classifier for u when trained on this bootstrap. 
Also, let R(u) := J2T=i l ^/ m - since l u e {°> 1}> the uncertainty 
score for instance u is given by its variance, which is: 



Uncertainty (u) = R(u)(l — R(u)) 



(1) 



Note that the power of using bootstrap is that we estimated R(u) 
and hence the variance without any assumptions on the classifica- 



~ Throughout this paper, to ease the presentation, we assume binary 
classification, i.e. 9 £ {0, 1}. However, our algorithms work for 
general classifiers. 

4n 



These will be addressed in Sec 



3.2 



and 



4.3 



Here we only focus 



Here, we use the ? symbol for an estimate obtained from the data 



on how to produce the scores that are used as sampling weights in 
our selection strategy. 



tion algorithm^] Thus, while many active learning proposals have 
tried to capture the uncertainty or variance of classifiers, they are ei- 
ther model-class specific (e.g., define the uncertainty of an SVM as 
the point's distance from the separator margin (32)), assume prob- 
abilistic classifiers that produce highly accurate class probability 
estimates [34), or are simply heuristics that do not guarantee an 
unbiased estimate of the classifier's variance. In contrast, our Un- 
certainty score applies to both probabilistic and non-probabilistic 
classifiers, and is guaranteed by bootstrap theory to provide an un- 
biased estimate of the variance. 

The intuition behind why our Uncertainty algorithm can reduce 
the overall error with fewer labels is inspired by the results from 
Kohavi and Wolpert [21] where they show that the classification 
error for item it can be decomposed into the sum of three terms: 
(i) the classifier's variance, vmkw( u ), (ii) the classifier's bias, 
bias 2 KW {u), and (iii) a noise term, a 2 (u), which is an error in- 
herent to the data collection process, and thus cannot be reduced. 
Using our notation, \wkw(u) will be R(u)(l — R(u)), and the 
squared bias is defined as [/(it) — R{u)\ 2 , where /(it) = _E[Z u |it], 
i.e., the expected value of the true label given it. Hence by asking 
for the label of it with large variance we are indirectly reducing 
the classification error. However, this is not all we can do. Boot- 
strap can also be used to reduce bias|13|. But, instead of estimating 
both bias and variance separately using bootstrap, we can directly 
estimate the classification error using bootstrap by simply compar- 
ing the bootstrapped labels Z* to the label generated by a classifier 
trained on the original training data. This is used in our second 
algorithm, described next. 

3.1.2 Minimizing the Expected Error 

The Uncertainty score identifies items that are most ambiguous 
or hardest for the classifier, and asks those from the crowd instead. 
However, it may also be a good idea to ask questions that the clas- 
sifier not uncertain about, but which if its answer about them is 
incorrect, would have the largest impact on the classifier's output. 
Hence we propose a new score which naturally combines both of 
these strategies. 

The idea is that we want to get a label for an item which is most 
likely to minimize the overall training error. If we knew that the 
current model's prediction, l u , was correct, then we could estimate 
the training error by adding {u, l u } to the existing training data L 
and retraining the model. Denote the error of this model by e r i g ht. 
On the other hand, if we knew that the predicted label was incor- 
rect, we could add {it, 1 — l u } to the training set and retrain the 
model. Denote the error of this model with e wrollg . The problem of 
course is that (i) we do not know what the true label is, and (ii) we 
do not know the exact error of each model. Solving (ii) is relatively 
easy: we assume those labels and use cross validation on our al- 
ready labeled data to estimate both errors, say e r i g ht and e WKm g. To 
solve the first problem, we can again bootstrap our current training 
data to estimate the probability that our current prediction (l u ) is 
correct. 

Following the same notation used in Sec 3.1.1 denote Z„ := 
6(L%, u), and let l u be the true label. Thus, our goal is to estimate 
p(u) := Pr[l u — l u \u], as follows. 



p{u) = 



(2) 



Here, 1(c) is the decision function which evaluates to 1 when con- 
dition c holds and to zero otherwise. Intuitively, equation {2} says 
that the probability of the model's prediction being correct can be 
estimated by the fraction of classifiers that agree on that prediction, 
z/ those classifiers are each trained on a bootstrap of the training set. 

Thus p(u) estimates the classification accuracy of the model for 
point it. Now, we can compute the expected error of the model if 
we incorporated u in the training data (i.e., asked the crowd to label 
it for us) by averaging over its label choices. 



MinExpError(u) 



p(M)e ri ght + (1 - p(w))e» 



(3) 



5 The only requirement for consistency of bootstrap is that the es- 
timator 6 be relatively smooth which holds for almost all exiting 
learning algorithms Jl3j. 



Since we want to sample such that u with a small M inExpError 
score is more likely to be picked, we define the score from which 
we will draw samples of it as: 

ExpTrainingAccuracy(u) := 1 — MinExpErroriu) (4) 

Without loss of generality, assume that the quantity eright — e wrollg is 
non-negative. An analogous decomposition is possible when it is 
negative. We can break down the left hand side of equation (H), i.e. 
ExpTraining Accuracy as: 

p(w)(l — (3,-ight) + (1 — P(li))(l — e w rong) 
= (1 — p(«))(eright — e W rong) + (1 — flight) (5) 

Note that the 1 — e r i g ht term is non-negative, and hence (1 — 
p (it)) (e^ght — ewrong) is a lower bound on ExpTrainingAccuracy. 
Thus this naturally combines both the hardness of the question and 
how much knowing its answer can improve our classifier. In other 
words, equation {5} tells us that if the question is too hard (large 
1 — p(u)), even if its answer does not affect our ability in classifying 
other items, we may still choose to ask the crowd to label it to 
avoid a high a risk of misclassification on it. On the other hand, we 
may ask a question for which our model is fairly confident (small 
1 — p(u)), but having its true label can still make a big difference in 
classifying other items, namely, the value of e wr on g is much smaller 
than e r i g ht- This means that, however unlikely, if our model happens 
to be wrong, we will have a higher overall accuracy and a lower 
overall error if we ask for the true label of it. 

Empirically, we found that a simple heuristic that adds some 
constant c > to ExpTrainingAccuracy(u) for all it 6 U and 
renormalizes this vector gives us a considerable edge over other 
methods. We have used c = 1 in all of our experiments in this 
paper. Intuitively, this is biasing the original distribution towards 
the uniform distribution while preserving the peaks in the original 
distribution. In Sec [3] we show that our scoring algorithms deliver 
a significant boost in the quality of a passive learner, which picks 
unlabeled points uniformly at random. 

3.2 Complexity and Scalability 

Training machine learning algorithms is typically a CPU inten- 
sive task. Training active learning algorithms is even more expen- 
sive, as some form of retraining is often required after each new 
example is received. In this section, we describe how our choice 
of bootstrap for active learning allows us to achieve scalability, 
while preserving the generality criterion of our models. The pri- 
mary benefit of bootstrap is that it is embarrassingly parallel. This 
means each bootstrap can be shipped to a different node or proces- 
sor, which can perform training in parallel. Only at the end will 
the output (labels) of each instance be sent to a single node that 
can perform some light weight integration of results. For instance, 
in the case of our Uncertainty algorithm, the final node will only 
have to estimate the variance among individual labels and perform 
a weighted sampling to decide on the next batch of questions to ask 
from the crowd. 



The time complexity of each iteration of our Uncertainty algo- 
rithm is 0(m ■ T(\U\)) where \U\ is the number of unlabeled data 
points in that iteration, T(.) is the training time of the classification 
algorithm (e.g., this is cubic in the input size for training SVMs), 
and m is the number of bootstraps. Hence, in each iteration, we 
only need m nodes to achieve the same run-time as training a sin- 
gle instance of the classifier. In Section |5~3"] we study the effect of 
changing m on the overall quality of our active learning algorithm. 

Our other algorithm, MinExpError, is computationally more de- 
manding as it involves a case analysis for each unlabeled point. The 
time complexity of each iteration of our MinExpError algorithm is 
0((m + \U\) ■ T(\U\)). Here the algorithm is still embarrassingly 
parallel, since each unlabeled instance can be analyzed in parallel. 
However, MinExpError will involve a higher number of nodes, e.g., 
to achieve the same performance as Uncertainty, we need 0(\U\) 
nodes in the cluster. Note that this additional overhead is often jus- 
tified since MinExpError is designed to deal with the upfront learn- 
ing scenarios that only involve one iteration where the job of the 
active learner is much more challenging. This is because the active 
learning only gets to decide on the questions to ask based on the 
limited set of initial labeled data and will not be able to adjust its 
decision in future iterations. Therefore, it is often justified to spend 
extra processing units for the upfront scenario in order to make bet- 
ter decisions that would to better cost saving overall, compared to 
using a less careful algorithm that would lead to asking many more, 
less effective questions from the crowd. 

4. Optimizing for the Crowd 

In this section, we describe several approaches we explored to 
deal with issues presented by performing active learning in a crowd 
setting and in a crowd-sourced database. Specifically, we look at 
the following questions: How do we deal with the fact that the 
crowd doesn't always answer questions correctly (Section |4.1}? 
how do we know when our accuracy is "good enough" (Section |4"2} ? 
And given that we can issue many queries in parallel to the crowd, 
what is the effect of batch size (number of simultaneous questions) 
on our learning performance (Section [4.3| l? We deal with each of 
these questions in the next three subsections. 

4.1 Handling Crowd Uncertainty 

Optimizing Redundancy for Subgroups: Previous active learn- 
ing approaches (with a few exceptions) have assumed that labels 
are provided by domain experts and hence, are perfectly correct 
(see Section [6). In contrast, in a crowd database, the crowd may 
produce answers that are incorrect, noisy and sometimes even ad- 
versarial. The conventional way to handle this is to use redundancy, 
e.g., to ask multiple workers to provide answers, and combine their 
answers to get the best overall result. Standard techniques, such 
as asking for multiple answers and using majority vote or the tech- 
niques of Dawid and Skene (DS) [11] can improve answer qual- 
ity when the crowd is mostly correct, but will not help much if 
users disagree more than they agree or when they converge to the 
right answers but too slowly. In our experience, for some classi- 
fication datasets, crowd workers can be quite imprecise. For ex- 
ample, we took 1000 tweets with hand-labeled ("gold data") senti- 
ment (dataset details in Sec |5.1.3} , removed these labels and asked 
Amazon Mechanical Turk workers to label them again and mea- 
sured the workers' agreement. We used different redundancy ra- 
tios (1, 3, 5) and different voting schemes (majority and Dawid and 
Skene (DS)) and computed the ability of the crowd to agree with 
the hand-produced labels. The results are shown in Table[T] 

Note that adding labels from 3 to 5 does not significantly increase 
the crowd's agreement in this case. 



Voting 
Scheme 


Majority Vote 


Dawid and 
Skene 


1 worker 
per label 


U / /O 


J 1 70 


3 workers 
per label 


70% 


69% 


5 labels 
per label 


70% 


70% 



Table 1: The effect of redundancy (using both majority voting 
and Dawid and Skene voting) on the accuracy of crowd labels. 

A second, perhaps more important, observation is that crowd ac- 
curacy often varies for different subgroups of the unlabeled data. 
For example, we asked Mechanical Turk workers to label facial ex- 
pressions in the CMU Facial Expression dataset^] and measured 
agreement with hand-supplied labels. This dataset consists of 585 
head-shots of faces of 20 users, each in 32 different combinations 
of different positions and facial expressions, where expressions are 
one of neutral, happy, sad or angry. We found that crowd perfor- 
mance was significantly worse when the faces were looking up ver- 
sus in other positions: 



Facial orientation 


Avg. accuracy 


straight 


0.6335% 


left 


0.6216% 


right 


0.6049% 


up 


0.4805% 



Similar patterns appear in several other data sets, with the crowd 
performing significantly worse for certain subgroups. To take ad- 
vantage of these two observations, we developed an algorithm that 
computes the optimal number of questions to ask about each sub- 
group, by estimating the probability p g with which the crowd cor- 
rectly classifies items of a given subgroup g, and then solving an 
integer linear programming (ILP) to choose the optimal number of 
questions (i.e., degree of redundancy) for labeling each item from 
that subgroup, given these probabilities. 

First, observe that when combining answers using majority vot- 
ing, with an odd number of votes, say 2v + 1, for an unlabeled item 
u with a true label of I, the probability of the crowd's combined an- 
swer 1* being correct is the probability that at most v or fewer work- 
ers get the answer wrong. Denoting this probability with P g j2v+X)< 
we have: 



P g ,(2v+±) = Pr(l = l*\2v + 1 votes) = 



El 2v + 1 | 2v + l-i 



(6) 



where p g is the probability that a crowd worker will correctly label 
an item in group g. 

Next, we describe our algorithm, called Partitioning Based Allo- 
cation (PBA), which partitions the items into subgroups and opti- 
mally allocates the budget to different subgroups, by computing the 
optimal number of votes per item, V g , for each subgroup g. PBA 
consists of three steps: 

1. Partition the dataset into G subgroups. This can be done ei- 
ther by partitioning on some low-cardinality field that is already 
present in the dataset to be labeled (for example, in an image recog- 
nition dataset, we might partition by the user who took the picture, 



°http : / /kdd .ics.uci.edu/ databases /faces/ 
faces . data . html 



or the time the picture was shot, e.g., at night or day), or by using an 
unsupervised clustering algorithm such as fc-means. For instance, 
in the CMU facial expression dataset, we partitioned the images 
based on user IDs, leading to G — 20 subgroups each with roughly 
32 images. 

2. Randomly pick no > 1 different data items from each sub- 
group, and obtain vq labels for each one of them; Estimate p g for 
each subgroup g, either by choosing data items for which the label 
is known and computing the fraction of labels that are correct, or 
by taking the majority vote for each of the n items, assuming it is 
correct, and then computing the fraction of labels that agree with 
the majority vote. For example, for the CMU dataset, we asked for 
vo — 9 labels for no = 2 random images from each subgroup, and 
hand-labeled those m • G = 40 images to estimate crowd's p g for 
9 = 1, •••,20. 

3. Solve an ILP to compute V g for all groups. Suppose we are 
given an overall budget B of questions we can ask. We want to 
allocate that amongst our G subgroups optimally. We use b g to 
denote the budget allocated to subgroup g, and create a binary in- 
dicator variable x g b whose value is 1 iff subgroup g is allocated a 
budget of b. Further suppose that our learner has chosen to try to 
label f g items from each subgroup g. 

We can then formulate an ILP with the objective function that seeks 
to minimize: 



E E x 9> • C 1 - 

9 = 1 6=1 



(7) 



where b max represents the maximum number of votes that we are 
willing to ask per one item. This goal function captures the ex- 
pected weighted error of the crowd, i.e., it has a lower value when 
we allocate a larger budget (x g t — 1 for a large b when P g > 0.5) 
to subgroups whose questions are harder for the crowd (P 9 ,t is 
small) or the learner has asked for more items from that group to 
be labeled (f g is large). 

This optimization is subject to the following constraints: 



9=1 6=1 

a b max 

^2^2x gb -b-f g <B 

9=1 6=1 



(8) 



(9) 



Here the constraint <j8j ensures that we pick exactly one b value for 
each subgroup, and constraint ([9j ensures that we stay within our 
overall item labeling budget. 
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Figure 3: Reducing the crowd's noise though optimal allocation 
of the budget to different partitions of the items 

We used this PBA algorithm on the CMU facial expressions dataset. 
We observed that the crowd had a particularly hard time telling the 



facial expression of some of the individual faces and had an eas- 
ier time with others, so we created subgroups based on the user 
column of the data set, and asked the crowd to label the expres- 
sion on each face. By choosing no = 2, Vo = 9, and b max = 9 
(as mentioned above), we ran PBA and compared its performance 
to a uniform budget allocation scheme where the same number of 
questions are asked about all items uniformly, as done in previous 
research (see[6]l. The results are shown in Figure|3] Here, the X axis 
shows the normalized budget, e.g. a value of 2 means the budget 
was twice the total number of unlabeled items. The Y axis shows 
the overall (classification) error of the crowd using majority voting 
under different allocations. Here, the solid (in reality) lines show 
the actual error achieved under both strategies, while the blue and 
green dotted lines show our estimates of their performance before 
running the algorithms. From Figure|3] we can see that even though 
our estimates of the actual error are not highly accurate, since we 
only use them to solve an ILP that would favor harder subgroups, 
our PBA algorithm (solid green) can still reduce the overall crowd 
error by about 10% (from 45% to 35%). We also show how PBA 
would perform if it had an oracle that provided access to exact val- 
ues of P 3t b (red line). 
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Figure 4: Improving the crowd's Fl -measure for entity resolu- 
tion in the iterative scenario. 

Balancing Classes: Our final observation about crowd's accuracy 
is that crowd workers perform better at classification tasks when 
the number of instances from each class is relatively balanced. For 
example, given a face labeling task where the goal is to tag each 
face as "man" or "woman", people perform worse at labeling the 
rarer class when that class is very infrequent. For example, if only 
0.1% of images are of men, then crowd workers will have a high 
error rate when labeling men (since they become conditioned to an- 
swer "woman"). This effect has been documented in other crowd- 
sourced settings as well |22|. 

Interestingly, both our Uncertainty and MinExpError algorithms 
naturally tend to increase the fraction of labels they obtain for rare 
classes. This is because they tend to have more uncertainty about 
items with rare labels (due to insufficient examples in their training 
set), and hence are more likely to ask users to label those items. 
Thus, our algorithms naturally improve the "balance" in the ques- 
tions they ask about different classes, which in turn improves crowd 
labeling performance. An example of this situation is reported in 
Figure]?] where we report the crowd's Fl-measure in an entity res- 
olution task under different active learning algorithms. The dataset 
used in this experiment has much fewer positive instances (11%) 
that it has negative ones (89%) For more details of this dataset 
sec Section [5TT1 The main observation here is that although the 
crowd's average Fl-measure for the entire dataset is 56% (this is 
achieved by the baseline which randomly picks items to be labeled), 
our Uncertainty algorithm can lift this up to 62% which is mainly 
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Figure 5: Predicting the model's quality using fc-fold cross val- 
idation. 

due to the fact that questions posed to the crowd had a more bal- 
anced mixture of positive and negative labels. 

4.2 When To Stop Asking 

Users may have a fixed budget, or want to learn a model that 
achieves a specific error level on one or more fidelity metrics (e.g., 
accuracy or Fl-measure). Given a fixed budget, it is straightforward 
to continue to ask questions until the budget is exhausted. 

To achieve a specific error level, a way to estimate the current 
error of the learned model is needed. The most straightforward way 
to do this is to learn a model based on the data collected so far, and 
measure its ability to accurately classify the gold data according 
to the desired metric. We can then continue to ask questions until 
a specific error level on gold data is achieved (or until the rate of 
improvement of error rates levels off.) 

In the absence of (sufficient) gold data, we adopt the standard 
method of k-fold cross validation, where we randomly partition 
the crowd-labeled data into test and training data, and measure the 
ability of a model learned on training data to predict test values. 
We repeat this procedure k times and average the error estimates 
to get an overall assessment of the model's quality. This method, 
according to our experiments, provides more reliable estimates of 
the model's current quality than estimating the training error on 
a small amount of gold data. Figure [5] shows the Fl-measure of 
an SVM model on the cancer dataset (see Section [5T2| >, and our 
estimates using this technique which is reasonably close, especially 
as more labels are obtained from the crowd. 

4.3 Effect of Batch Sizes 

In this section, we study the relationship between the number of 
items we submit to crowd at a time (the "batch size"), the overall 
quality of the results, and the time needed to return all the answers 
to the user. Increasing batch size allows for more parallelism from 
the crowd (as several workers can label items at one time), and 
decreases training algorithm time (by performing fewer iterations 
in the iterative case), but decreases model quality as larger batches 
present fewer opportunities to learn from data as it arrives. 

The effect of batch size on result quality can be dramatic, as 
shown in Figure [6] Here we show that the Fl-measure gains can 
be in the 8-10% range, which (as we explain in Section|5j is quite 
significant. 

Larger batch sizes also reduce runtime substantially, as shown in 
FigureJT] Here, going from batch size 1 to 200 reduces the time to 
train a model and collect data from the cloud by about two orders of 
magnitude (from thousands of seconds to tens of seconds), which 
could be significant. 

An important question is how to pick batch size. If we have no 
constraints on runtime, clearly smaller batches sizes are preferable. 
Given a constraint on runtime however, the proper choice of batch 
size is a more challenging task. The simplest choice would be to 



0.98 



0.96 



> 
< 



0.94 



0.92 



0.9 



IterativeUncertainty 
IterativeMinExpError 



* : . 



■-*.,.* ♦ 



50 



100 150 
Batch Size 



200 



Figure 6: Effect of batch size on the Fl-measure on the vehi- 
cle dataset, for a fixed budget of 400 questions, using our two 
algorithms. 
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Figure 7: The effect of batch size on the processing time of our 
active learners, for a fixed budget of 400 questions. 

pick the smallest batch size B that meets our time constraint. How- 
ever, observe that choosing a batch size B' larger than B could 
reduce runtime, allowing us to label more data points, which could 
increase the overall quality, perhaps more than reducing the batch 
size from B' to B. One approach would be to formulate this trade- 
off as an optimization function. Doing so, however, is difficult be- 
cause the relationship between result quality, the number of data 
items labeled, and batch size depends on the nature of the data set 
and the exact classification algorithm. Hence, we leave this model- 
ing exercise as future work, and for now assume that smaller batch 
sizes are always preferable. 

5. Experimental Results 

In this section, we present several experiments designed evalu- 
ate the effectiveness of our active learning algorithms (Section [3} 
and crowd optimization techniques (Section[4]l. We consider differ- 
ent criteria that have been used for evaluating active learning tech- 
niques (described in Section|5j and compare our techniques against 
a random baseline (that chooses items to have the crowd label at 
random) as well as several state-of-the-art algorithms. Our primary 
goal is to understand how much more quickly and / or accurately 
our active learning methods can label a dataset in comparison to a 
baseline that chooses items to have the crowd label at random. 

In sections [5.1 1 we report experiments we ran on Amazon Me- 
chanical Turk on entity resolution, vision, and sentiment analysis 
tasks. Then, in Section |5T2| we provide our results on 15 well- 
known datasets from the UCI KDD repository. Finally, we study 
the run- time and scalability of our algorithms in Section [53] 

Experimental Setup. All the algorithms were implemented in 
Matlab 2012a and tested on a Dell PowerEdge R710 server with 



two quad core Intel Xeon 2.4 GHz processors and 24GB of RAM, 
running Ubuntu 10.10 with kernel version 2.6.35. Throughout this 
section, unless stated otherwise, we have used the following pa- 
rameters: we repeated each experiment 20 times and reported the 
average result, every HIT cost 10 for the worker and 0.50 for us- 
ing Amazon Mechanical Turk's service, and the size of the initial 
training and the batch size were respectively 0.03% and 10% of the 
unlabeled set. 

Methods Compared. We run experiments on five different learn- 
ing algorithms, in both the upfront and iterative scenarios. 

1. Uncertainty: The method of Section [3.1.1| 

2. MinExpError. The method of Section p.1.2 1 

3. MarginDistance: The method of |32| |31| , which is specifi- 
cally designed for active learning with SVMs, and picks items that 
are closer to the margin (see Section[6](. 

4. Bootstrap-LV: Another bootstrap-based active learning tech- 
nique proposed by 1 28 1. This method can only be applied to proba- 
bilistic classifiers, e.g., we will not include this in experiments with 
SVMs (see Section|6j. 

5. Baseline: A naive method that randomly selects unlabeled 
items to send to the crowd. 

In the plots, we append the scenario name to the algorithm names, 
e.g. UpfrontMinExpError or IterativeBaseline. We have repeated 
our experiments with different classifiers as the underlying learner, 
including SVM, naive Bayesian classifier (NBC), neural networks, 
and decision trees. For lack of space, we only report the results of 
each experiment for one type of classifier. When not specified, we 
have used linear SVM. 

Evaluation Metrics. Active learning algorithms are typically eval- 
uated based on their learning curve, which plots the quality mea- 
sure of interest (e.g., accuracy or Fl-measure) as a function of the 
number of data items that are labeled |29| , for instance see Figure[8] 
To quantitatively compare different learning curves, the following 
metrics are often used: 

1. Area under curve (AUC) of the learning curve. 

2. AUCLOG which is the AUC of the learning curve when the 
X-axis is in log-scale. 

3. Quality lift which is the (average or maximum) vertical dis- 
tance between two learning curves. 

4. Cost reduction or the (average or maximum) horizontal dis- 
tance between two learning curves. 

The higher the AUC, the better as it indicates achieving higher 
quality for the same cost/number of questions. Due to the diminishing- 
return shape of learning curves, the average quality lift is usually in 
a — 16% range. 

AUCLOG is another metric that's important in active learning 
where the intuition is to favor algorithms that improve the metric of 
interest early on (e.g., with few examples). Due to the logarithm, 
the improvement of this measure is typically in a — 6% range. 

The cost reduction captured by the average horizontal distance 
two curves indicates the average cost or number of questions that 
would be saved by one curve compared to the other, if the goal 
was to achieve a given quality level, say reaching a Fl measure of 
80%. For fairness, we compute this measure by taking an average 
over all the quality values that are achievable by both curves. In 
some cases, when the two curves do not have a range in common 
and interpolation is not reasonable, this number is undefined. The 
number of questions saved by a competent active learner should 
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be larger than lx (potentially 100 x or more), while a value < 1 
indicates a performance worse than the a baseline that asks random 
questions. 

The AUC is equivalent to the average vertical distance of the two 
curves, when the budget range of both curves is the same. There- 
fore, in this paper we just report the average (and some times max- 
imum) quality lift, e.g., the Avg. Fl lift. We also report the AU- 
CLOG and the average number of questions saved. Unless speci- 
fied otherwise, for each algorithm under study, we report the rela- 
tive improvement of these measures compared to a random base- 
line which is learner that chooses its labeled data by randomly 
choosing the same number of questions from the same pool of un- 
labeled data. 

Many real-world datasets are unbalanced, e.g., they contain many 
more negative labels than positive ones. For such datasets, accu- 
racy (i.e., the ratio of correctly classified items) is a poor measure 
of quality because a classifier that always predicts the majority la- 
bel has a high accuracy but no practical use. Moreover, precision 
and recall of a classifier can often be improved at the cost of the 
other, e.g., making the classifier predict more positive labels will 
increase recall but cause precision to drop. For these reasons, the 
Fl-measure of the minority class if often used as a more reliable 
quality measure for classification tasks [33 1. Therefore, in this pa- 
per, due to space constraints, we generally report the Fl-measure. 
More details (e.g., individual experiments, precision, recall) can be 
found in (24). 

5.1 Crowd-sourced Datasets 

We experiment with several datasets labeled using Amazon Me- 
chanical Turk in this paper. In this section, we report the perfor- 
mance of our algorithms on each of them. 

5.1.1 Entity Resolution 

Entity resolution is the task of finding different records that refer 
to the same entity. Entity resolution (ER) is an essential step in data 
integration and cleaning, especially when data comes from multiple 
sources. Crowd-sourcing is typically a more accurate means for ER 
than machine learning, but also more expensive and slower 1 35 1. 

We used the Producj^] dataset which contains product attributes 
(name, description, and price) of items listed on |abt .co m and 
|buy . com| websites. The task is to detect identical items on the 
two websites that listed under different attributes. We used the 
same crowd-sourced version of this dataset used in [ 35 ] where the 
crowd has labeled 8315 pairs of these items as identical (12%) or 
non-identical (88%), where each pair has been labeled by 3 dif- 
ferent workers, with an average accuracy of 89% and Fl-measure 
of 56%. This dataset consists of We also used the same classifier 
used in [35 1, namely a linear SVM where each pair of items is rep- 
resented by the Levenshtein and Cosine similarities of their names 
and descriptions. When trained on 3% of the data, this classifier 
has an average accuracy of 80% and and Fl-measure of 40%. 

The results are shown in Figure [8] All methods improve with 
more questions, but Uncertainty improves more quickly than the 
others, as it is able to identify the data items about which the model 
has the most uncertainty and get the crowd to label them. Note that 
MarginDistance does only slightly better than the baseline. 

5.1.2 Image Detection 

Vision-related problems are another area in which crowd-sourcing 
is heavily utilized, e.g., in tagging pictures, identifying objects, and 
providing bounding boxes | 34|. In all of our vision experiments, 

*http://dbs.uni-leipzig.de/ f ile/Abt-Buy . zip 
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Figure 8: The overall Fl-measure for entity resolution in the 
iterative scenario. 

we employed a relatively simple classifier where the PHOW fea- 
tures (a variant of dense SIFT descriptors commonly used in vision 
tasks 1 8 1) of a set of images are first extracted as a bag of words, 
and then a linear SVM is used for their classification. While this 
is not the state-of-the-art image detection algorithm, we show that 
even with this simple classifier, our active learning algorithms can 
greatly reduce the cost of many challenging vision tasks. 
Gender Detection. We used the faces from CaltechlOl dataset |T4) 
and manually labeled each image with its gender (266 males, 169 
females) as our ground truth. We also gathered crowd labels by 
asking the gender of each image from 5 different workers. We 
started by training the model on a random set of 11% of the data. 
In Figure [9] we show the accuracy of the crowd, the accuracy of 
our machine learning model and also the overall accuracy of the 
model plus crowd data. For instance, when a fraction x of the la- 
bels were obtained from the crowd, the other 1 — x labels were 
determined from the model, and thus, the overall accuracy was 
x*a c +(l~ x)*a m , where a c and a m are the crowd and model's ac- 
curacy, respectively. Similar to our entity resolution experiments, 
our algorithms lift the quality of the labels that are provided by 
the crowd, i.e. by asking questions for which the crowd tends to 
be more reliable. In this case, though, the overall quality of the 
crowd is much higher than in the entity resolution case and there- 
fore the lift of the crowd's accuracy is only from 98.5% to 100%. 
Figure [9] shows that both MinExpError and Uncertainty perform 
well in the upfront scenario, respectively lifting the accuracy of the 
baseline by 4% and 2% on average, and lifting its AUCLOG by 2- 
3%. Here, due to the upfront scenario, MinExpError saves the most 
number of questions. The baseline has to ask 4.7x (3.7x) more 
questions than MinExpError (Uncertainty) to achieve the same ac- 
curacy. Again MarginDistance, although specifically designed for 
SVM, achieves little improvement over the baseline. 
Object Containment. We again mixed 50 human faces and 50 
background images from CaltechlOl [ 14]. Because telling human 
faces from background clutter is easy for humans, we used the 
ground truth as crowd labels in this experiment. Figure [To] shows 
the upfront scenario with an initial set of 10 labeled images, where 
both Uncertainty and MinExpError lift the baseline's Fl-measure 
by 16%, while MarginDistance provides a lift of 13%. All the 
three algorithms increase the baseline's AUCLOG by 5 — 6%. Note 
that the baseline's Fl-measure slightly degrades as it reaches higher 
budgets. This is because some of the images are harder to clas- 
sify, which the baseline finally has to answer them, while the ac- 
tive learning algorithms avoid them, leaving them in the last batch 
which will be handled by the crowd. 

5.1.3 Sentiment Analysis 

Popular microblogging web-sites such as Twitter have become 
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Figure 10: The objection inclusion task (whether a scene con- 
tains a human or not). Fl-measure of the model. 

rich sources of data for sentiment analysis (a.k.a. opinion min- 
ing) [25], where a politician (or business) can ask questions such 
as "how many of the tweets that mention Obama (or the iPhone) 
have a positive or negative sentiment?". Training accurate classi- 
fiers requires a sufficient amount of accurately labeled data, and 
with over hundreds of million daily tweets, asking the crowd to all 
of them is infeasible. In this experiment, we show that with our 
active learning algorithms, with as little as 1K-3K crowd-labeled 
tweets, we can achieve very high accuracy and Fl-measure on a 
corpus of 10K-100K unlabeled tweets. 

We randomly chose 100K tweets from an online corpu^jthat 
provides ground truth labels for the tweets, with equal number of 
positive and negative-sentiment tweets. To obtain crowd labels, we 
obtained labels ( positive, negative, neutral, or vague/unknown) for 
each tweet from 5 different workers. Figure[TT](a) shows the results 
for using 3K initially labeled data points in the 100K dataset in the 
upfront setting. The results confirm that the upfront scenario is best 
handled by our MinExpError Algorithm. Here, the MinExpError, 
Uncertainty and MarginDistance algorithms improve the average 
Fl-measure of the baseline model by 11%, 9% and 5%, respec- 
tively. Also, MinExpError increases baseline's LOGAUC by 4%. 
All three active learning algorithms reduce the number of required 
questions to achieve a given accuracy or Fl dramatically. In com- 
parison to the baseline, MinExpError, Uncertainty, and MarginDis- 
tance reduce the number of questions by factors of 46 x , 32 x , and 
27 x, respectively. 



Model-F1 -measure 




20,000 40,000 60,000 80,000100,000 
Total # of questions asked 

Figure 11: The sentiment analysis task: Fl-measure of the 
model for 100K tweets in the upfront scenario. 

5.2 UCI Classification Datasets 
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Figure 9: The object detection task (detecting the gender of the person in an image): accuracy of the (a) crowd, (b) model, (c) overall 



In Section [5~T| we validated our algorithms on crowd-sourced 
datasets. In this section we also compare our algorithms to state- 
of-the-art techniques using well-known datasets from UCI KDD 
repository 1 1 1. For these datasets, the labels are provided by ex- 
perts; that is, the ground truth and the crowd labels are the same. 
Thus, by excluding the effect of an imperfect crowd, we can com- 
pare different active learning strategies in isolation. We have cho- 
sen 15 well-known datasets, as shown in Tables [2] and [5] In gen- 
eral, the more competent the underlying classifier, the larger the 
benefit of using active learning. Therefore, out of fairness and 
to keep our results unbiased, we have avoided any dataset-specific 
tuning or preprocessing steps, we applied the same classifier with 
the same settings to all datasets. For the same reason, we have se- 
lected most of these datasets to only include numerical attributes 
so that different discretization strategies do not affect the results, 
since discretization might work for a dataset but not for the others. 
In each case, we examined 10 different question asking budgets of 
10%, 20%, • • • , 100%, each repeated 10 times and have reported 
the average. Also, to compute Fl -measure for datasets with more 
than 2 classes, we have either put all but the majority class in a sin- 
gle class, or have arbitrarily partitioned all the classes into two new 
ones (details in [24]). 

Here, other then the random baseline, we compare against two 
other active learning techniques, namely MarginDistance and Bootstrap- 
LV. Bootstrap-LV is designed only for probabilistic classifiers i.e., 
when the classifier provides class probability estimates (CPE) as 
well as the predicted class label. Thus, for all learning techniques, 
we used MATLAB's implementation of decision trees as our clas- 
sifier. We used default parameters except for the following: no 
pruning, no leaf merging, and a 'minparent' of 1 (impure nodes 
with 1 or more observations can be split). 

Tables[2]and[3]show the results of these experiments for both up- 
front and iterative settings. We report all the measures of different 
active learning algorithms in terms of their performance improve- 
ment relative to the baseline (so higher numbers are better). For 
instance, consider Table|2] Here, a reported number of 1.04 under 
the forth column of the cancer dataset means that the AUCLOG of 
the Fl -measure with the Uncertainty algorithm is on average 4% 
higher than that with the baseline. Likewise, the 1.08 number in 
the last column of this dataset means that the average Fl -measure 
of MinExpError is 10% higher than that of the baseline. As for the 
Avg. # of Questions Saved', a reported number of 14.83 under the 
sixth column of cancer dataset, means that the baseline needs on av- 
erage 14.83 x more questions than the MarginDistance algorithm 
in order to achieve a given Fl level. 

In summary, these results are consistent with those observed 
with crowd-sourced datasets. In the upfront setting, MinExpEr- 



ror is significantly beneficial and superior to other active learning 
techniques, with more than 104 x savings in the total number of 
questions on average (this is the average of the average savings per 
dataset, i.e. the maximum improvement per each dataset is much 
higher than this). MinExpError also improves the AUCLOG and 
average Fl-measure of the baseline on average by 5% and 15%, 
respectively. After MinExpError, the Uncertainty and Bootstrap- 
LV are most effective with a comparable performance, i.e. 55-69 x 
savings, improving the AUCLOG by 3%, and lifting the average 
Fl-measure by 11-12%. Least effective in the upfront scenario is 
MarginDistance which still provides around 13 x saving. 

For the iterative scenario, Uncertainty actually works better than 
MinExpError, with an average saving of 7x over the baseline and 
an increase in AUCLOG and average Fl-measure by 1% and 3%, 
respectively. Note that this is twice the improvement of the previ- 
ous active learning techniques. Note that the reason why the sav- 
ings are in general more modest than in the upfront case, is that 
in the iterative setting the baseline receives much more labeled 
data and therefore, its average performance is much higher than 
that in the upfront case, and hence there is less room for improve- 
ment. However, given the comparable (and even slightly better) 
performance of Uncertainty compared to MinExpError in the it- 
erative scenario, it makes a much more favorable choice for this 
scenario considering that Uncertainty incurs significantly less pro- 
cessing overhead than MinExpError (see Sectionp3). 



5.3 Run-time and Scalability 

To measure algorithm runtime, we experimented with multiple 
datasets but due to the similarity of the observed trends and lack 
of space, here we only report the results for the vehicle dataset. In 
Figure [7] (previously shown in Section|4), we can see that training 
runtimes depend heavily on batch size (as the batch size determines 
how many times the model needs to be re-trained) and range from 
about 5,000 seconds to a few seconds. 

We also studied the effect of parallelism on our algorithms' run- 
time. Here we compared different active learning algorithms in the 
upfront scenario on the twitter dataset (10K tweets) as we enabled 
cores on a multicore machine. The results are shown in Figure [T2| 
Here, for Uncertainty, the run-time only improves until we have as 
many cores as we build bootstraps of the data (here, 10) and after 
that the improvement is marginal. On the other hand, MinExpError 
scales extremely well, achieving nearly linear speedup because it 
re-runs the model once for every training point. 

6. Related Work 

Crowd-sourcing. Several research groups have recently done work 
on integrating crowd-sourcing and human operators into data work- 
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Table 2: Improvement of different active learning algorithms over the baseline for the upfront scenario. 
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Table 3: Improvement of different active learning algorithms over the baseline for the iterative scenario. 



flows and database systems 1 1 5 . 19 26 23J. These crowd-enabled 
databases face limitations when dealing with large datasets. While 
these systems try to use DB techniques to reduce the amount of un- 
necessary work needed by humans (e.g., the number of pair-wise 
comparisons in a join query), in the end the crowd still has to pro- 
vide at least as many labels as there are unlabeled items that are 
queried by the user. It is simply not feasible to label millions of 
data items this way. Our algorithms are motivated by such large- 
scale datasets and aim to avoid obtaining crowd labels for a signifi- 
cant portion of the data by training and exploiting machine learning 
models. 

Active Learning. There has been a large body of work on active 
learning in the machine learning literature (see |29| for a survey). 



However, this field has traditionally dealt with situations where a 
moderate number of datapoints in a specific domain (e.g., med- 
ical diagnosis |17| ) need to be labeled and where labels are ob- 
tained form highly trained experts (e.g., doctors). As a result, most 
of these techniques are domain or model-class specific, computa- 
tionally expensive, and often assume perfect labels. There have 
been many domain-specific active learning techniques, e.g. in vi- 
sion |34| , entity-resolution |4|, and text classification J 3 1 1 are only 
few to name. Our algorithms however work for general classifica- 
tion tasks and do not use or require any domain knowledge. 

Focusing on items for which the learner is most uncertain has 
been used in many active learning approaches. However, some of 
these approaches are specific to a class-model.For instance, a com- 
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Figure 12: The effect of parallelism on processing time. 

mon SVM-based algorithm selects unlabeled items based on their 
distance to the SVM's margin, i.e. the proximity to the margin 
is treated as an indicator of the classifier's uncertainty in its pre- 
diction |31||32| . In this paper, we have referred to this technique 
as MarginDistance and as a representative of these model-specific 
techniques we have compared it against our active learning algo- 
rithms, showing that ours (despite their generality) still achieve sig- 
nificantly better results, even for SVM classifiers. 

Another group of active learning algorithms that have used uncertainty- 
based strategies, have assumed a probabilistic classifier (i.e. one 
that provides class probability estimates) |34|[28 |. Perhaps, most 
relevant to our approach is that taken by |28 | where the authors 
have also used bootstrap, in a technique called Bootstrap-LV, that 
uses the model's class probability estimates to measure of notion of 
uncertainty. In Section[5] we showed that, in the iterative scenario, 
our Uncertainty algorithm performs better or comparably, while in 
the upfront scenario our MinExpError algorithm is significantly su- 
perior to Bootstrap-LV. Moreover, we do not assume probabilistic 
classifiers and use a simple notion of variance based on the classi- 
fier's class prediction. 

Active learning in crowd-sourcing Recently, a few other papers 
have tried to apply active learning in crowd-sourcing |2, 4 35 1 for 
the specific problem of entity resolution. These solutions are typ- 
ically only applicable to entity resolution, e.g. they use similarity 
metrics among pairs of items to detect pairs that are unlikely to be 
identical |35| , or assume an imbalanced dataset and focus on only 
maximizing recall 1 2 , 4 1 . Our algorithms are on the other hand quite 
general. However, in Section [5. 1.1| we applied Uncertainty and 
MinExpError to the same crowd-sourced dataset used in 1 35 1 and 
showed that our active learners can still improve the Fl -measure by 
about 10% even for those pairs of items that would be sent to the 
crowd by [35| 



Yan et. al |36] also look at the problem of active learning from 
a group of workers, but focus on the problem of picking the best 
worker to answer each question. This is different from our scenario 
because in crowdsourcing systems like Mechanical Turk, the crowd 
database has no control over which users answer a given item. Pu- 
jara et al 127] also propose using active learning in a crowdsourced 
setting where they ask crowd workers to label the lowest confi- 
dence data items. However, they simply use proximity to the deci- 
sion boundary as a metric of confidence, which, as we showed in 
our experiments, does not perform much better than a random base- 
line. Filtering low-quality workers has been discussed in [ 12 1 while 
the effect of redundancy on accuracy has been studied in |30|. Our 
PBA algorithm improves on [30] by considering crowd error for 
different subgroups, while 1 30 1 assumes that the crowd error is in- 
dependent of the item being classified. 

Batch-mode active learning. Batching has been used as a viable 



approach in active learning with large corpus of data 1 1 8| . This is 
in the same spirit as our iterative scenario. The problem of batch- 
size selection in the active learning community has been studied 
(e.g., by Guo et al |16|). Adapting these algorithms to the crowd- 
sourced setting described in Section 4.3 would be an interesting 
direction for future work. Also, it has been shown that diversifying 
the items in each batch can improve the results |9|. Our algorithm 
implicitly diversify each batch through a weighted sampling which 
ensures that even low-score items have a chance to be labeled (as 
opposed to a top-K selection strategy). There has also been work 
on stopping criteria in the machine learning community 1 7 1 which 
we might be able to adapt to a crowd-sourcing scenario beyond the 
simple scheme presented in Section 4.2. 

7. Conclusions 

In this paper, we proposed an approach to integrate new active 
learning algorithms into crowd-sourced databases. Specifically, we 
proposed two active learning algorithms designed for two different 
settings. In the upfront setting, we ask all questions from the crowd 
in one go. In the iterative setting, the questions to be asked from the 
crowd are adaptively picked, and added to the labeled pool, which 
is followed by retraining the model and iterating over the process. 
While this is more expensive because of the iterative retraining, it 
also has a higher chance of learning a better model. We design two 
algorithms Uncertainty and MinExpError based on the theory of 
non-parametric bootstrap, leading to wide applicability to a broad 
range of machine learning models. We also proposed algorithms 
for choosing the number of questions to ask from different crowd- 
workers, based on the characteristics of the data being labeled, and 
studied the effect of batching on the overall runtime and quality 
of our active learning algorithms. Our results, on three data sets 
collected with Amazon's Mechanical Turk, and with 15 data sets 
from the UCI KDD archive, show that compared to choosing items 
to label at random, our algorithms make 8x fewer label requests 
than existing active learning techniques in the upfront scenario, and 
6x fewer label requests in the iterative scenario. We believe that 
these algorithms would prove to be immensely useful in crowd- 
sourced database systems. 
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